Cross-Lingual Word Representations via Spectral Graph Embeddings
نویسندگان
چکیده
Cross-lingual word embeddings are used for cross-lingual information retrieval or domain adaptations. In this paper, we extend Eigenwords, spectral monolingual word embeddings based on canonical correlation analysis (CCA), to crosslingual settings with sentence-alignment. For incorporating cross-lingual information, CCA is replaced with its generalization based on the spectral graph embeddings. The proposed method, which we refer to as Cross-Lingual Eigenwords (CL-Eigenwords), is fast and scalable for computing distributed representations of words via eigenvalue decomposition. Numerical experiments of English-Spanish word translation tasks show that CLEigenwords is competitive with stateof-the-art cross-lingual word embedding methods.
منابع مشابه
Learning Cross-lingual Word Embeddings via Matrix Co-factorization
A joint-space model for cross-lingual distributed representations generalizes language-invariant semantic features. In this paper, we present a matrix cofactorization framework for learning cross-lingual word embeddings. We explicitly define monolingual training objectives in the form of matrix decomposition, and induce cross-lingual constraints for simultaneously factorizing monolingual matric...
متن کاملWiktionary-Based Word Embeddings
Vectorial representations of words have grown remarkably popular in natural language processing and machine translation. The recent surge in deep learning-inspired methods for producing distributed representations has been widely noted even outside these fields. Existing representations are typically trained on large monolingual corpora using context-based prediction models. In this paper, we p...
متن کاملA survey of cross-lingual embedding models
Cross-lingual embedding models allow us to project words from different languages into a shared embedding space. This allows us to apply models trained on languages with a lot of data, e.g. English to low-resource languages. In the following, we will survey models that seek to learn cross-lingual embeddings. We will discuss them based on the type of approach and the nature of parallel data that...
متن کاملBilBOWA: Fast Bilingual Distributed Representations without Word Alignments
We introduce BilBOWA (Bilingual Bag-ofWords without Alignments), a simple and computationally-efficient model for learning bilingual distributed representations of words which can scale to large monolingual datasets and does not require word-aligned parallel training data. Instead it trains directly on monolingual data and extracts a bilingual signal from a smaller set of raw-text sentence-alig...
متن کاملConceptNet at SemEval-2017 Task 2: Extending Word Embeddings with Multilingual Relational Knowledge
This paper describes Luminoso’s participation in SemEval 2017 Task 2, “Multilingual and Cross-lingual Semantic Word Similarity”, with a system based on ConceptNet. ConceptNet is an open, multilingual knowledge graph that focuses on general knowledge that relates the meanings of words and phrases. Our submission to SemEval was an update of previous work that builds high-quality, multilingual wor...
متن کامل